Pandas Data Analysis

安装 pip


later use utf-8-sig instead of utf-8

pip install --target=d:\somewhere\other\than\the\default package_name
带空格处理
pip install --target=X:\My~Documents\py_pkg\ selenium

进行分组统计

import pandas as pd
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
     'key2':['one', 'two', 'one', 'two', 'one'],
     'data1':np.random.randn(5),
     'data2':np.random.randn(5)})
df
      data1     data2 key1 key2
0 -0.410673  0.519378    a  one
1 -2.120793  0.199074    a  two
2  0.642216 -0.143671    b  one
3  0.975133 -0.592994    b  two
4 -1.017495 -0.530459    a  one

grouped = df['data1'].groupby(df['key1'])

grouped.mean()
key1
a      -1.182987
b       0.808674
dtype: float64

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()
key2       one       two
key1                    
a    -0.714084 -2.120793
b     0.642216  0.975133

选取两个dataframe中列公共的元素

有df1.ID, df2.ID 两列，我需要选取两列中都有的元素。
可以用如下方法：

import pandas as pd
m_df=pd.merge(df1,df2,how="inner",on="ID")
ind=m_df.ID
df1.set_index('ID',inplace=True)
df2.set_index('ID',inplace=True)
common_df1=df1.loc[ind,:]
common_df2=df2.loc[ind,:]

改变pandas 列的类型

把数值型的改变成string

import pandas as pd
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
     'key2':['one', 'two', 'one', 'two', 'one'],
     'data1':np.random.randn(5),
     'data2':np.random.randn(5)})

 df["data1"]=df["data1"].astype(str)

寻找缺失值或者NaN值所带的位置

                   a         b
2011-01-01 00:00:00 1.883381  -0.416629
2011-01-01 01:00:00 0.149948  -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354  NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326  0.070416
2011-01-01 06:00:00 0.401665  NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795  -0.764629


import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]

df['a'].ix[index[0]]
>>> 1.452354

df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]

iloc 与 loc 区别 at 与 iat 区别 ix

dataframe.iloc按下标选取，或者使用dataframe.loc按索引选取：

如果不是需要访问特定行列，而只是某个特殊位置的元素的话，dataframe.at和dataframe.iat是最快的方式，它们分别用于使用索引和下标进行访问：

ataframe.ix可以混合使用索引和下标进行访问，唯一需要注意的地方是行列内部需要一致，不可以同时使用索引和标签访问行或者列，不然的话，将会得到意外的结果。

pandas 处理缺失数据

原始数据的中很可能存在一些数据的缺失，就如同现在处理的这个样例数据一样，处理缺失数据有多种方式。通常使用dataframe.dropna()，dataframe.dropna()可以按行丢弃带有nan的数据；若指定how=’all’（默认是’any’），则只在整行全部是nan时丢弃数据；若指定thresh，则表示当某行数据非缺失列数超过指定数值时才保留；要指定根据某列丢弃可以通过subset完成。

print "Data size before filtering:"
print df.shape

print "Drop all rows that have any NaN values:"
print "Data size after filtering:"
print df.dropna().shape
print df.dropna().head(10)

print "Drop only if all columns are NaN:"
print "Data size after filtering:"
print df.dropna(how='all').shape
print df.dropna(how='all').head(10)

print "Drop rows who do not have at least six values that are not NaN"
print "Data size after filtering:"
print df.dropna(thresh=6).shape
print df.dropna(thresh=6).head(10)

print "Drop only if NaN in specific column:"
print "Data size after filtering:"
print df.dropna(subset=['closePrice']).shape
print df.dropna(subset=['closePrice']).head(10)